Variable selection using random forests

نویسندگان

  • Robin Genuer
  • Jean-Michel Poggi
  • Christine Tuleau-Malot
چکیده

This paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The main contribution is twofold: to provide some insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Airborne Lidar Feature Selection for Urban Classification Using Random Forests

Various multi-echo and Full-waveform (FW) lidar features can be processed. In this paper, multiple classifers are applied to lidar feature selection for urban scene classification. Random forests are used since they provide an accurate classification and run efficiently on large datasets. Moreover, they return measures of variable importance for each class. The feature selection is obtained by ...

متن کامل

High-Dimensional Variable Selection for Survival Data

The minimal depth of a maximal subtree is a dimensionless order statistic measuring the predictiveness of a variable in a survival tree. We derive the distribution of the minimal depth and use it for high-dimensional variable selection using random survival forests. In big p and small n problems (where p is the dimension and n is the sample size), the distribution of the minimal depth reveals a...

متن کامل

Random Forests: some methodological insights

This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of va...

متن کامل

Random Forests-based Feature Selection for Land-use Classification Using Lidar Data and Orthoimagery

The development of lidar system, especially incorporated with high-resolution camera components, has shown great potential for urban classification. However, how to automatically select the best features for land-use classification is challenging. Random Forests, a newly developed machine learning algorithm, is receiving considerable attention in the field of image classification and pattern re...

متن کامل

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition Letters

دوره 31  شماره 

صفحات  -

تاریخ انتشار 2010